Fix exception ref-count leak in non-ref catch handlers.#12860
Fix exception ref-count leak in non-ref catch handlers.#12860cfallin wants to merge 2 commits intobytecodealliance:mainfrom
Conversation
When a Wasm `throw` instruction executes, the `throw_ref` libcall was cloning the GC ref (incrementing the refcount), but the catch handler never decremented it. This caused every caught exception to leak, leading to unbounded GC heap growth in throw/catch loops. Two fixes: 1. Remove the unnecessary `clone_gc_ref()` in `throw_ref`. The `throw`/`throw_ref` instructions consume the exnref operand, so ownership transfers naturally to `pending_exception` without cloning. 2. In `create_catch_block`, emit a `drop_gc_ref` call for non-ref catches (`Catch::One`, `Catch::All`) after field extraction. These catches consume the exnref without passing it to the branch target, so the refcount must be decremented. Also adds `Store::gc_heap_size()` / `StoreContext::gc_heap_size()` accessors and a `throw_catch_many_times` integration test that throws and catches 100K exceptions in a loop, asserting the GC heap stays within a single 64 KiB page.
| /// Drop a GC reference to an exception object, decrementing its ref count. | ||
| /// | ||
| /// This should be called when a `catch` (non-ref) handler has finished | ||
| /// extracting fields from the caught exnref and no longer needs it. | ||
| pub fn translate_drop_exnref( | ||
| func_env: &mut FuncEnvironment<'_>, | ||
| builder: &mut FunctionBuilder<'_>, | ||
| exnref: ir::Value, | ||
| ) -> WasmResult<()> { | ||
| log::trace!("translate_drop_exnref({exnref:?})"); | ||
| match func_env.tunables.collector { | ||
| #[cfg(feature = "gc-drc")] | ||
| Some(Collector::DeferredReferenceCounting) => { | ||
| let drop_gc_ref_libcall = func_env.builtin_functions.drop_gc_ref(builder.func); | ||
| let vmctx = func_env.vmctx_val(&mut builder.cursor()); | ||
| builder.ins().call(drop_gc_ref_libcall, &[vmctx, exnref]); | ||
| } | ||
| // The null collector doesn't track ref counts, so nothing to do. | ||
| _ => { | ||
| // Avoid unused-parameter warning. | ||
| let _ = builder; | ||
| } | ||
| } | ||
| Ok(()) | ||
| } |
There was a problem hiding this comment.
A couple things:
- If we add a new collector that requires some kind of explicit drop here, we will silently repeat this bug because of the wildcard. We shouldn't have wildcards for this kind of thing.
- Even better would be to move this into a trait method on the GC compiler, as the dual of
GcCompiler::alloc_exn.
| } else { | ||
| // For non-ref catches, the exnref is consumed here and not | ||
| // passed to the branch target. | ||
| environ.translate_drop_exnref(builder, exn_ref)?; |
There was a problem hiding this comment.
I don't quite understand. The GC ref should logically be held alive by the Wasm stack (in the over-approx table) and then get dropped upon the next GC. Why wasn't that happening? Are we missing a call to expose_gc_ref_to_wasm in the throw code? (Probably)
If so, then we should do that instead.
There was a problem hiding this comment.
We do call expose_gc_ref_to_wasm here when we manually convert to a wasmtime::ValRaw but you're right, I think it's missing in the throw path. I'll dig deeper -- thanks!
|
So digging into this a bit more I see that we always The root issue is actually that the growth heuristic for the GC heap creates the appearance of a leak (and, arguably as we play with semantics, maybe a leak in practice?) The problem I'm trying to solve is that I want setjmp/longjmp with exceptions to be robust and lightweight; that's what the test in this PR is made to emulate. If we continually throw and catch, without holding the exnref, we have only at most one object live at a time. A theoretically ideal GC should somehow see that and keep the working-set size small. I had short-circuited the "deferred" part of DRC here with explicit drops which I agree is the wrong approach now that I dig in more. But it should also be the case that a C program using SjLj with exceptions should not grow the GC heap to its max size (4GiB say?) before ever collecting -- that seems like an objectively bad heuristic choice. Perhaps what we need is a collection heuristic like what we do for |
It needs to be called every time you pass a reference from the host, to Wasm. For example, it looks like it is missing here: wasmtime/crates/wasmtime/src/runtime/vm/libcalls.rs Lines 1740 to 1753 in 5d52f56 I'm pretty sure that adding a call to
Right, I agree that we need to fix the heuristics for heap growth vs collection, but I don't think we should be special-casing Instead, we should fix the heuristics and make it so that In the meantime, you can update the test to use a resource limiter, custom memory creator, or the pooling allocator to constrain the max size of memories (and therefore also the GC heap) and it should still exercise the leak-checking even without the grow-or-collect heuristics fixed. The pooling allocator is probably the easiest. |
I'm a bit confused: the code that you link is a libcall that takes a ref from Wasm and pulls it into the host (to save on the store). Did you mean to link somewhere else? Separately: writing a variant of this test, that allocates a GC struct and immediately drops it in a loop, causes the GC heap to grow without collection up to its size limit in the same way. That seems to confirm to me that this is not exn-specific but rather a general GC growth heuristic problem?
Yes, agreed; the next paragraph of my earlier comment suggests a growth heuristic (for the whole GC heap, not just for exnref-specific cases), exactly as you're saying. (In other words: I am definitely not proposing an exceptions-specific change.) I would imagine it could/should become the default growth heuristic. What do you think of the proposed heuristic? |
|
Ah sorry I got confused because I had assumed that this was not about the heuristics and was something specific to the interaction between
Yes, this is roughly what I've been imagining, although there is the minor wrinkle of growing in units of pages, but wanting to do the accounting at a finer-grained level so that the single-page GC heap use case works correctly. To be precise, what I've had in my head is this:
This should also avoid doing a bunch of GCs for each ~power of two on the way up to 64KiB, which is just wasting cycles. How does that sound? |
And because refcounting operates on anti-matter rather than matter, this does unfortunately mean that the DRC collector will need to always have a running count of allocated bytes, rather than being able to compute it just at GC time like the copying collector will be able to. Ah well, not a big deal. |
|
That does sound reasonable, and I'm happy to work on that -- thanks! Closing this in the meantime (I'll bring back both the sjlj exception test and struct-alloc-then-drop test once we have said heuristic and assert they both stay within one page). |
When a Wasm
throwinstruction executes, thethrow_reflibcall was cloning the GC ref (incrementing the refcount), but the catch handler never decremented it. This caused every caught exception to leak, leading to unbounded GC heap growth in throw/catch loops.Two fixes:
Remove the unnecessary
clone_gc_ref()inthrow_ref. Thethrow/throw_refinstructions consume the exnref operand, so ownership transfers naturally topending_exceptionwithout cloning.In
create_catch_block, emit adrop_gc_refcall for non-ref catches (Catch::One,Catch::All) after field extraction. These catches consume the exnref without passing it to the branch target, so the refcount must be decremented.Also adds
Store::gc_heap_size()/StoreContext::gc_heap_size()accessors and athrow_catch_many_timesintegration test that throws and catches 100K exceptions in a loop, asserting the GC heap stays within a single 64 KiB page.